The Structured Information Manager: A Database System for SGML Documents
نویسنده
چکیده
One of the important standards for document interchange and representation that has emerged is SGML, the Standard Generalized Markup Language. SGML is designed to capture the logical structure of documents, i.e. the logical components such as titles and paragraphs and their interrelationships. SGML is a complex standard, and the design of a database system for managing SGML documents poses many challenges. In this talk, we describe an SGML conformant database system, called the Structured Information Manager (SIM), and illustrate how the support of document structure can help in many important applications by describing how SIM has been deployed to provide public access to databases of legislation. The Structured Information Manager (SIM) is a document database system designed to manage multigigabyte collections of documents containing unstructured text (ASCII), structured text (including SGML and MARC), binary objects (such as images and videos) and other kinds of data. As an information retrieval system, SIM provides a client-server model of processing and supports a wide range of user interface platforms, including command line, MSWindows, Macintosh, and X. SIM uses compressed. inverted file technology for accessing large text collections using both query and browsing paradigms [ZobMof92]. Both Boolean and natural language queries are supported and response times are sub-second, even for multigigabyte databases. SIM is standards based. It provides direct support for SGML, the international standard for document representation and interchange and 239.50, the international standard for client server communication in an information retrieval applications [SacArn95]. For Web access, an HTTP to 239.50 translation is supported. By directly supporting SGML, documents of arbitrary complexity can be supported by SIM and a collection of documents can be treated as a database of information. SIM is supported and marketed in Australia and New Zealand by Ferntree Computer Corporation. Research and
منابع مشابه
Docbase - a Database Environment for Structured Documents
Standard Generalized Markup Language (SGML) has been widely accepted as a standard for document representation. The strength of SGML lies in the fact that it embeds logical structural information in documents while preserving a human-readable form. This structural information in SGML documents allows processing of these documents using database techniques. SGML facilitates this goal by providin...
متن کاملExtending SGML to Accommodate Database Functions: A Methodological Overview
* Partially supported by US Dept. of Education award number P200A502367 and NSF Research and Infrastructure grant, award number NSF CDA-9303189. Abstract A method for augmenting an SGML document repository with database functionality is presented. SGML [ISO 8879, 1986] has been widely accepted as a standard language for writing text with added structural information that gives the text greater ...
متن کاملStandardizing the Querying Process with SGML The SQL DTD
One of the most exciting applications of SGML which has emerged in the recent years is its use in document databases. The structural information embedded in SGML documents makes it possible to query SGML documents and extract information in an automatic manner; however, this querying process has not been standardized. As a result, different SGML database implementations use their own query lang...
متن کاملDatabase Systems for Structured Documents
Documents stored in a database system can have complex internal structure described by languages such as SGML. How to take advantage of this structure presents challenges for database system implementors. We classify the types of queries that need to be supported by SGML-conformant database systems. We then describe several data models that have been proposed for representing documents in a dat...
متن کاملToward the Union of Databases and Document Management: The Design of DocBase
With the advent of the World Wide Web (WWW) and the increased use of electronic documents in almost all aspects of computing, the problems of management of and systematic information retrieval from electronic documents have become highly pertinent. Information retrieval (IR) techniques allow us to retrieve documents based on keywords, but often these searches are not powerful enough to accurate...
متن کامل